Extending the tool, or how to annotate historical language varieties

نویسندگان

  • Cristina Sánchez Marco
  • Gemma Boleda
  • Lluís Padró
چکیده

We present a general and simple method to adapt an existing NLP tool in order to enable it to deal with historical varieties of languages. This approach consists basically in expanding the dictionary with the old word variants and in retraining the tagger with a small training corpus. We implement this approach for Old Spanish. The results of a thorough evaluation over the extended tool show that using this method an almost state-of-the-art performance is obtained, adequate to carry out quantitative studies in the humanities: 94.5% accuracy for the main part of speech and 92.6% for lemma. To our knowledge, this is the first time that such a strategy is adopted to annotate historical language varieties and we believe that it could be used as well to deal with other non-standard varieties of languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic discovery of Latin syntactic changes

Syntactic change tends to affect constructions, but treebanks annotate lower-level structure: PCFG rules or dependency arcs. This paper extends prior work in native language identification, using Tree Substitution Grammars to discover constructions which can be tested for historical variability. In a case study comparing Classical and Medieval Latin, the system discovers several constructions c...

متن کامل

WordNet.PT global – Extending WordNet.PT to Portuguese varieties

This paper reports the results of the WordNet.PTglobal project, an extension of WordNet.PT to all Portuguese varieties. Profiting from a theoretical model of high level explanatory adequacy and from a convenient and flexible development tool, WordNet.PTglobal achieves a rich and multipurpose lexical resource, suitable for contrastive studies and for a vast range of language-based applications c...

متن کامل

LEARNER INITIATIVES ACROSS QUESTION-ANSWER SEQUENCES: A CONVERSATION ANALYTIC ACCOUNT OF LANGUAGE CLASSROOM DISCOURSE

This paper investigates learner-initiated responses to English language teachers’ referential questions and learner initiatives after teachers’ feedback moves in meaning-focused question-answer sequences to analyze how interactional practices of language teachers, their initiation and feedback moves, facilitate learner initiatives. Classroom discourse research has largely neglected learner init...

متن کامل

Tagging Historical Corpora - the problem of spelling variation

Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes ...

متن کامل

Scientific Diagrams as Traces of Group-Dependent Cognition: A Brief Cognitive-Historical Analysis

Recent research has begun to explore the role of diagrams as cognitive tools. Here I develop new conceptual and methodological tools for exploring the sociality of cognition involving diagrams. First, I distinguish two varieties of groupdependent cognition. Second, extending Nersessian’s method of cognitive-historical analysis, I show how a suitablyinformed “literature review” of diagrams publi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011